DBSCAN — Density-Based Clustering (Finding crowds instead of shapes)#
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that looks for crowds: regions where points are packed together, separated by regions that are sparse.
That density-first view gives you two superpowers:
arbitrary shapes (clusters don’t have to be “ball-shaped”)
noise detection (points that don’t belong anywhere can be labeled as noise)
The cost: DBSCAN is sensitive to a distance scale (eps) and can struggle when clusters have very different densities.
DBSCAN is also commonly used after dimensionality reduction (PCA/UMAP/t-SNE): first embed to a space where distance is meaningful, then run a density-based clusterer.
Learning goals#
By the end you should be able to:
explain DBSCAN as “finding crowds instead of shapes”
define ε-neighborhoods and distinguish core / border / noise points
describe density reachability and the cluster expansion process
see (with Plotly) how
epsandmin_sampleschange the resultcompare DBSCAN with k-means and HDBSCAN (and know when to use which)
Notation (quick)#
Dataset:
X = {x1, …, xn}withxi ∈ R^dDistance metric:
dist(x, z)(we’ll use Euclidean unless stated)ε-neighborhood of point
p:Nε(p) = { q ∈ X : dist(p, q) ≤ ε }min_samples: minimum number of points inNε(p)(usually countingpitself) forpto be a core pointDBSCAN labels:
cluster IDs:
0, 1, 2, ...noise:
-1(inscikit-learn)
Table of contents#
Intuition: finding crowds instead of shapes
Density concepts: ε-neighborhood, core/border/noise
Algorithm steps: density reachability + cluster expansion
DBSCAN from scratch (NumPy)
Plotly lab: effect of
epsandmin_samplesPlotly lab: noise detection + choosing
epsComparison: DBSCAN vs k-means vs HDBSCAN
scikit-learnDBSCAN: practical parameter mapPractical checklist + pitfalls
Exercises
References
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.colors import qualitative
from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import make_moons, make_blobs
from sklearn.metrics import adjusted_rand_score
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Intuition: “finding crowds instead of shapes”#
Many clustering algorithms start by assuming a shape.
k-means assumes clusters look like spheres (in your feature space): “assign points to the closest center.”
DBSCAN assumes clusters look like crowds: “a cluster is where you can keep walking from point to nearby point without ever crossing a sparse desert.”
That’s why DBSCAN can discover non-convex clusters (like two interleaving moons) and also label outliers as noise.
# A classic non-convex dataset + some uniform noise
X_moons, _ = make_moons(n_samples=450, noise=0.06, random_state=42)
noise = rng.uniform(
low=X_moons.min(axis=0) - 0.6,
high=X_moons.max(axis=0) + 0.6,
size=(60, 2),
)
X = np.vstack([X_moons, noise])
fig = px.scatter(x=X[:, 0], y=X[:, 1], title="Two moons + uniform noise (no clustering yet)")
fig.update_traces(marker=dict(size=6, opacity=0.85))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
2) Density concepts: ε-neighborhood, core/border/noise#
DBSCAN is built from one local question:
“Around this point, is it crowded?”
ε-neighborhood#
Pick a radius ε (epsilon). The ε-neighborhood of a point p is all points within distance ε.
Core / border / noise points#
Given ε and min_samples:
Core point: has at least
min_samplespoints in its ε-neighborhood.Border point: not core, but lies inside the ε-neighborhood of a core point.
Noise point: neither core nor border.
Let’s visualize a single ε-neighborhood as a circle.
def circle_xy(center_xy: np.ndarray, radius: float, n: int = 200) -> tuple[np.ndarray, np.ndarray]:
t = np.linspace(0, 2 * np.pi, n)
cx, cy = center_xy
return cx + radius * np.cos(t), cy + radius * np.sin(t)
eps_demo = 0.25
p_idx = 25
p = X[p_idx]
dists = np.linalg.norm(X - p, axis=1)
nbr_mask = dists <= eps_demo
fig = go.Figure()
# All points (faded)
fig.add_trace(
go.Scatter(
x=X[:, 0],
y=X[:, 1],
mode="markers",
name="all points",
marker=dict(size=6, color="lightgray", opacity=0.5),
)
)
# Neighbors within eps
fig.add_trace(
go.Scatter(
x=X[nbr_mask, 0],
y=X[nbr_mask, 1],
mode="markers",
name=f"Nε(p) (ε={eps_demo})",
marker=dict(size=7, color=qualitative.Set2[1], opacity=0.9),
)
)
# The chosen point p
fig.add_trace(
go.Scatter(
x=[p[0]],
y=[p[1]],
mode="markers",
name="p",
marker=dict(size=11, color=qualitative.Set2[0], symbol="star"),
)
)
# Circle
cx, cy = circle_xy(p, eps_demo)
fig.add_trace(go.Scatter(x=cx, y=cy, mode="lines", name="ε-ball", line=dict(color="black", width=1)))
fig.update_layout(title="An ε-neighborhood is a circle around p", legend=dict(orientation="h"))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
def point_types(X: np.ndarray, eps: float, min_samples: int) -> np.ndarray:
"""Return an array of strings: 'core' / 'border' / 'noise' for each point."""
X = np.asarray(X)
eps2 = float(eps) ** 2
sq = np.sum(X**2, axis=1, keepdims=True)
sq_dists = sq + sq.T - 2 * (X @ X.T)
sq_dists = np.maximum(sq_dists, 0.0)
nbr_counts = np.sum(sq_dists <= eps2, axis=1)
is_core = nbr_counts >= min_samples
# border: non-core point that is in the eps-neighborhood of at least one core point
is_border = (~is_core) & (np.any((sq_dists <= eps2) & is_core[None, :], axis=1))
is_noise = (~is_core) & (~is_border)
types = np.empty(X.shape[0], dtype=object)
types[is_core] = "core"
types[is_border] = "border"
types[is_noise] = "noise"
return types
eps_demo = 0.25
min_samples_demo = 6
types = point_types(X, eps=eps_demo, min_samples=min_samples_demo)
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=types,
category_orders={"color": ["core", "border", "noise"]},
color_discrete_map={"core": qualitative.Set2[0], "border": qualitative.Set2[1], "noise": "lightgray"},
title=f"Core / border / noise given ε={eps_demo}, min_samples={min_samples_demo}",
)
fig.update_traces(marker=dict(size=6, opacity=0.9))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
3) Algorithm steps: density reachability + cluster expansion#
DBSCAN turns local density into clusters using reachability.
Density reachability#
A point
qis directly density-reachable frompif:q ∈ Nε(p)andpis a core point.
A point
qis density-reachable frompif there exists a chain:p → p1 → p2 → ... → qwhere each step is directly density-reachable.
Intuition: you can “walk” from p to q while always stepping inside ε-neighborhoods of core points.
Cluster expansion (the algorithmic idea)#
Pick an unvisited point.
If it’s not core, it can’t start a cluster (it might later become a border point).
If it is core, start a new cluster and grow it by repeatedly adding neighbors.
You can think of this as BFS/DFS over a graph:
vertices are points
an edge exists if two points are within ε
but only core points are allowed to “expand” the frontier
Pseudocode (classic DBSCAN)#
labels = UNASSIGNED
cluster_id = 0
for each point p:
if p is visited: continue
mark p visited
neighbors = Nε(p)
if |neighbors| < min_samples:
label p as NOISE # might be relabeled later
else:
start new cluster C = cluster_id
expand(C, p, neighbors)
cluster_id += 1
expand(C, p, neighbors):
label p as C
seed_set = neighbors
while seed_set not empty:
q = pop(seed_set)
if q not visited:
mark q visited
q_neighbors = Nε(q)
if |q_neighbors| >= min_samples: # q is core
seed_set = seed_set ∪ q_neighbors
if q is UNASSIGNED or NOISE:
label q as C
Two small details to notice:
Border points can be labeled as noise initially, then later pulled into a cluster.
Border points that touch multiple clusters can be assigned depending on traversal order.
4) DBSCAN from scratch (NumPy)#
For learning, we’ll implement the algorithm with a simple (dense) distance matrix. This is O(n²) memory/time, which is fine for small demos.
scikit-learn uses spatial indexing (KD-tree / ball tree) when possible for faster neighborhood queries.
def dbscan_from_scratch(X: np.ndarray, eps: float, min_samples: int) -> np.ndarray:
X = np.asarray(X, dtype=float)
n = X.shape[0]
eps2 = float(eps) ** 2
# Squared Euclidean distance matrix via (x-y)^2 = x^2 + y^2 - 2 x·y
sq = np.sum(X**2, axis=1, keepdims=True)
sq_dists = sq + sq.T - 2 * (X @ X.T)
sq_dists = np.maximum(sq_dists, 0.0)
neighbors = [np.flatnonzero(sq_dists[i] <= eps2) for i in range(n)]
is_core = np.array([len(nbrs) >= min_samples for nbrs in neighbors], dtype=bool)
labels = np.full(n, -1, dtype=int) # -1 = noise/unassigned
visited = np.zeros(n, dtype=bool)
cluster_id = 0
for p in range(n):
if visited[p]:
continue
visited[p] = True
if not is_core[p]:
continue
# Start a new cluster
labels[p] = cluster_id
seed_set = set(neighbors[p].tolist())
seed_set.discard(p)
while seed_set:
q = seed_set.pop()
if not visited[q]:
visited[q] = True
if is_core[q]:
seed_set.update(neighbors[q].tolist())
# If q is unassigned/noise, assign it to this cluster
if labels[q] == -1:
labels[q] = cluster_id
cluster_id += 1
return labels
def plot_clusters_2d(X: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
X = np.asarray(X)
labels = np.asarray(labels)
label_text = np.where(labels == -1, "noise", labels.astype(str))
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=label_text,
color_discrete_map={"noise": "lightgray"},
title=title,
)
fig.update_traces(marker=dict(size=6, opacity=0.9))
fig.update_yaxes(scaleanchor="x", scaleratio=1)
return fig
eps = 0.25
min_samples = 6
labels_scratch = dbscan_from_scratch(X, eps=eps, min_samples=min_samples)
labels_sklearn = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X)
ari = adjusted_rand_score(labels_sklearn, labels_scratch)
print(f"Agreement with sklearn (ARI): {ari:.3f}")
plot_clusters_2d(X, labels_scratch, title="DBSCAN from scratch (NumPy)").show()
Agreement with sklearn (ARI): 1.000
5) Plotly lab: effect of eps and min_samples#
DBSCAN has two knobs that shape everything:
eps(ε): how far you consider “nearby”min_samples: how many neighbors you need to be a “crowd”
If you remember one mental model:
Increase
eps→ neighborhoods connect more easily → clusters merge, less noise.Increase
min_samples→ harder to be core → more noise, clusters can fragment.
def describe_labels(labels: np.ndarray) -> tuple[int, float]:
labels = np.asarray(labels)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
noise_frac = float(np.mean(labels == -1))
return n_clusters, noise_frac
eps_values = [0.18, 0.25, 0.35]
min_samples_fixed = 6
for eps in eps_values:
labels = DBSCAN(eps=eps, min_samples=min_samples_fixed).fit_predict(X)
n_clusters, noise_frac = describe_labels(labels)
title = f"Vary ε: eps={eps}, min_samples={min_samples_fixed} | clusters={n_clusters}, noise={noise_frac:.1%}"
plot_clusters_2d(X, labels, title=title).show()
eps_fixed = 0.25
min_samples_values = [3, 6, 12]
for m in min_samples_values:
labels = DBSCAN(eps=eps_fixed, min_samples=m).fit_predict(X)
n_clusters, noise_frac = describe_labels(labels)
title = f"Vary min_samples: eps={eps_fixed}, min_samples={m} | clusters={n_clusters}, noise={noise_frac:.1%}"
plot_clusters_2d(X, labels, title=title).show()
6) Plotly lab: noise detection + choosing eps#
DBSCAN’s noise label is often the feature you want the most.
It can act like a simple outlier detector.
It can keep a clustering result honest: “these points don’t really belong anywhere.”
A common heuristic for choosing eps is the k-distance plot:
Pick
k = min_samples.For each point, compute the distance to its k-th nearest neighbor.
Sort those distances; look for the “elbow”.
The elbow is roughly where points transition from “in crowds” to “in sparse regions”.
def k_distance_plot(X: np.ndarray, k: int) -> tuple[go.Figure, np.ndarray]:
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X)
dists, _ = nn.kneighbors(X)
kth = np.sort(dists[:, -1])
fig = px.line(
y=kth,
title=f"{k}-distance plot (sorted distances to the {k}th nearest neighbor)",
labels={"x": "points (sorted)", "y": f"distance to {k}th nearest neighbor"},
)
return fig, kth
k = 6
fig, kth = k_distance_plot(X, k=k)
fig.show()
# Noise detection demo: add a few extreme outliers
outliers = rng.uniform(low=[-3.5, -3.5], high=[3.5, 3.5], size=(12, 2))
X_out = np.vstack([X, outliers])
eps = 0.25
min_samples = 6
labels = DBSCAN(eps=eps, min_samples=min_samples).fit_predict(X_out)
n_clusters, noise_frac = describe_labels(labels)
title = f"DBSCAN noise detection | eps={eps}, min_samples={min_samples} | clusters={n_clusters}, noise={noise_frac:.1%}"
plot_clusters_2d(X_out, labels, title=title).show()
7) Comparison: DBSCAN vs k-means vs HDBSCAN#
A quick way to remember the tradeoffs:
Method |
Needs |
Finds non-convex shapes? |
Labels noise? |
Handles varying density well? |
|---|---|---|---|---|
k-means |
✅ |
❌ |
❌ |
❌ |
DBSCAN |
❌ |
✅ |
✅ |
⚠️ (often struggles) |
HDBSCAN |
❌ |
✅ |
✅ |
✅ |
k-means: great for roughly spherical clusters; fast; but forces every point into a cluster.
DBSCAN: great when clusters are “dense blobs/curves” and you want noise detection.
HDBSCAN: like DBSCAN, but it builds a hierarchy over density scales and picks stable clusters — often better when density varies.
# Same dataset, k-means vs DBSCAN
labels_kmeans = KMeans(n_clusters=2, n_init=10, random_state=42).fit_predict(X)
plot_clusters_2d(X, labels_kmeans, title="k-means (k=2) on two moons").show()
labels_db = DBSCAN(eps=0.25, min_samples=6).fit_predict(X)
plot_clusters_2d(X, labels_db, title="DBSCAN on two moons (eps=0.25, min_samples=6)").show()
# Optional: HDBSCAN (not installed by default in many environments)
try:
import hdbscan # type: ignore
except Exception:
hdbscan = None
if hdbscan is None:
print(
"HDBSCAN not installed. If you want to run this section locally, install with:\n"
" pip install hdbscan\n"
"or (often easier):\n"
" conda install -c conda-forge hdbscan"
)
else:
clusterer = hdbscan.HDBSCAN(min_cluster_size=20, min_samples=6)
labels_h = clusterer.fit_predict(X)
plot_clusters_2d(X, labels_h, title="HDBSCAN on two moons (auto density scale)").show()
HDBSCAN not installed. If you want to run this section locally, install with:
pip install hdbscan
or (often easier):
conda install -c conda-forge hdbscan
8) scikit-learn DBSCAN: practical parameter map#
Key parameters in sklearn.cluster.DBSCAN:
eps: distance threshold (your most important knob)min_samples: how “crowded” you require a core point to bemetric: distance metric ("euclidean"default, but others are available)n_jobs: parallelism for neighbor search (depends on sklearn version)
Practical pattern: scale features before DBSCAN unless you already trust the units.
# Scaling matters: eps is in *distance units*
X_weird = X.copy()
X_weird[:, 0] *= 10.0 # stretch x-axis
labels_unscaled = DBSCAN(eps=0.25, min_samples=6).fit_predict(X_weird)
plot_clusters_2d(X_weird, labels_unscaled, title="DBSCAN on stretched data (no scaling)").show()
X_scaled = StandardScaler().fit_transform(X_weird)
labels_scaled = DBSCAN(eps=0.25, min_samples=6).fit_predict(X_scaled)
plot_clusters_2d(X_scaled, labels_scaled, title="DBSCAN after StandardScaler (distance scale restored)").show()
9) Practical checklist + pitfalls#
Scaling is not optional when features have different units (DBSCAN uses distances directly).
Choosing
epsis the main art:k-distance plots help, but don’t over-trust the elbow.
what counts as “near” depends on the metric and the embedding space.
Varying density is DBSCAN’s classic failure mode (one ε can’t fit all). Consider HDBSCAN.
High dimensions make distance less informative (neighbors become “all the same distance”).
Border assignment ambiguity: border points can be assigned to different clusters depending on traversal order.
Complexity: naive DBSCAN is O(n²); practical implementations rely on efficient neighbor search.
Exercises#
On the moons dataset, find an
epswhere DBSCAN merges both moons into one cluster. Explain why.Create a dataset with one dense blob and one sparse blob. Show where DBSCAN fails and explain how HDBSCAN helps.
Implement a faster neighbor query (KD-tree / ball tree) and compare runtime vs the O(n²) version.
Use DBSCAN on a PCA-reduced real dataset (e.g., digits) and inspect the noise points.
References#
Ester, Kriegel, Sander, Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (DBSCAN). (KDD)
scikit-learndocs: DBSCAN (sklearn.cluster.DBSCAN)Campello, Moulavi, Sander (2013). Density-Based Clustering Based on Hierarchical Density Estimates (HDBSCAN). (PAKDD)